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AUDIO VISUAL DETECTION OF VOICE ACTIVITY FOR SPEECH RECOGNITION SYSTEM 

The present invention relates generally to automatic speech recognition 
systems and methods and more particularly to an automatic speech recognition system and 
method wherein an automatic speech recognizer only responsive to acoustic speech 
utterances is activated only in response to acoustic energy having a spectrum associated with 
the speech utterances and at least one facial characteristic associated with the speech 
utterances. 

Currently available speech recognition systems determine the beginning and 
end of utterances by responding to the presence and absence of only acoustic energy having a 
spectrum associated with the utterances. If a microphone associated with the speech 
recognition system is in an acoustically noisy environment including, for example, speakers 
other than the speaker whose voice is to be recognized or activated machinery, including 
telephones (particularly ringing telephones), the noise limits the system performance. Such 
speech recognition systems attempt to correlate the acoustic noise with words it has learned 
for a particular speaker, resulting in the speech recognition system producing an output that is 
unrelated to any utterance of the speaker whose voice is to be recognized. In addition, the 
speech recognition system may respond to the acoustic noise in a manner having an adverse 
effect on its speech learning capabilities. 

We are aware that the prior art has considered the problems associated with an 
acoustically noisy environment by detecting acoustic energy and facial characteristics of a 
speaker whose voice is to be recognized. For example, Maekawa et al, U.S. patent 
5,884,257, and Stork et al, U.S. patent 5,621,858, disclose voice recognition systems that 
respond to acoustic energy of a speaker, as well as facial characteristics associated with 
utterances by the speaker. In Maekawa et al., Hp movement is detected by a visual system 
including a light source and light detector. The system includes a speech period detector 
which derives a speech period signal by detecting the strength and duration of the movement 
of the speaker's lips. The system also includes a voice recognition system and an overall 
judgment section which determines the content of an utterance based on the acoustic energy 
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in the utterance and movement of the lips of the speaker. In Stork et al., lip, nose and chin 
movement are detected by a video camera. Output signals of a spectrum analyzer responsive 
to acoustic energy and a position vector generator responsive to the video camera supply 
signals to a speech classifier trained to recognize a limited set of speech utterances based on 
the output signals of the spectrum analyzer and position vector generator. 

In both Maekawa et al. and Stork et al., complete speech recognition is 
performed in parallel to image recognition. Consequently, the speech recognition processes 
of these prior art devices would appear to be somewhat slow and complex, as well as require 
a significant amount of power, such that the devices do not appear to be particularly well- 
suited as remote control devices for controlling equipment. 



In accordance with one aspect of the present invention, a speech recognition 
system comprises (1) an acoustic detector for detecting speech utterances of a speaker, (2) a 

15 visual detector for detecting at least one facial characteristic associated with speech 

utterances of the speaker, and (3) a processing arrangement connected to be responsive to the 
acoustic and visual detectors for deriving a signal. The signal has first and second values 
respectively indicative of the speaker making and not making speech utterances such that the 
first value is derived only in response to the acoustic detector detecting a finite, nonzero 

20 acoustic response while the visual detector detects at least one facial characteristic associated 
with speech utterances of the speaker. A speech recognizer for deriving an output indicative 
of the speech utterances as detected only by the acoustic detector is connected to be 
responsive to the acoustic detector only while the signal has the first value. 

Another aspect of the invention relates to a method of recognizing speech 

25 utterances of a speaker with an automatic speech recognizer only responsive to acoustic 
speech utterances of the speaker. The method comprises: (1) detecting acoustic energy, 
having a spectrum associated with speech utterances, (2) detecting at least one facial 
characteristic associated with speech utterances of the speaker, and (3) activating the 
automatic speech recognizer only in response to the detected acoustic energy having a 

30 spectrum associated with speech utterances while the at least one facial characteristic 
associated with speech utterances of the speaker is occurring. 

Preferably, activation of the automatic speech recognizer is prevented in 
response to any of: (1) no acoustic energy having a spectrum associated with speech 
utterances being detected while no facial characteristic associated with speech utterances of 
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the speaker is detected, (2) acoustic energy having a spectrum associated with speech 
utterances being detected while no facial characteristic associated with speech utterances of 
the speaker is detected, and (3) no acoustic energy having a spectrum associated with speech 
utterances being detected while at least one facial characteristic associated with speech 
5 utterances of the speaker is detected. 

In the preferred embodiment, the beginning of each speech utterance is 
assuredly coupled to the speech recognizer. The beginning of each speech utterance is 
assuredly coupled to the speech recognizer by: (a) delaying the speech utterance, (b) 
recognizing the beginning of each speech utterance, and (c) responding to the recognized 

1 0 beginning of each speech utterance to couple the delayed speech utterance associated with 
the beginning of each speech utterance to the speech recognizer and thereafter sequentially 
coupling the remaining delayed speech utterances to the speech recognizer. It is assured that 
no detected acoustic energy is coupled to the speech recognizer upon the completion of a 
speech utterance. Assurance that no detected acoustic energy is coupled to the speech 

1 5 recognizer upon the completion of a speech utterance is provided by: (a) delaying the 

acoustic energy associated with the speech utterance, (b) recognizing the completion of each 
speech utterance, and (c) responding to the recognized completion of each speech utterance 
to decouple delayed acoustic energy occurring after the completion of each speech utterance 
from the speech recognizer. 

20 In the preferred apparatus embodiment, the delay is provided by a ring buffer 

that is effectively indexed so that segmented detected acoustic energy at the beginning of the 
utterance and segmented detected acoustic energy at the end of the utterance and segmented 
detected acoustic energy between the beginning and end of the utterance are coupled to the 
speech recognizer to the exclusion of acoustic energy prior to the beghming of the utterance 

25 and acoustic energy subsequent to the end of the utterance. 

The processing arrangement in first and second embodiments respectively 
includes a lip motion and a face recognizer. The face recognizer is preferably arranged for 
enabling the signal to have the first value only in response to the face of the speaker being at 
a predetennined orientation relative to the visual detector. The face recognizer also 

30 preferably: (1) detects and distinguishes the faces of a plurality of speakers, and (2) enables 
the signal to have the first value only in response to the speaker having a recognized face. 

In the second embodiment, the processing arrangement also includes a speaker 
identity recognizer for: (1) detecting and distinguishing speech patterns of a plurality of 
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speakers, and (2) enabling the signed to have the first value only in response to the speaker 
having a recognized speech pattern. 

The above and still further objects, features and advantages of the present 
invention will become apparent upon consideration of the following detailed description of a 
specific embodiment thereof, especially when taken in conjunction with the accompanying 
drawing. 

Figure 1 is a block diagram of a preferred embodiment of the speech 
recognition system in accordance with one embodiment of the present invention; and 

Figure 2 is a block diagram of a modified portion of the speech recognition 
system of Figure 1. 

Reference is now made to the Figure 1 of the drawing wherein microphone 10 
and video camera 12 are respectively responsive to acoustic energy in a spectrum including 
utterances of a speaker and optical energy associated with at least one facial characteristic, 
particularly lip motion, of utterances by the speaker. Microphone 10 and camera 12 
respectively derive electrical signals that are replicas of the acoustic and optical energy 
incident on them in the spectra they are designed to handle. 

The electrical output signal of microphone 10 drives analog to digital 
converter 14 which in turn drives acoustic energy detector circuit 16 and speech segmentor 
circuit 18 in parallel. Acoustic energy detector 16 derives a bi-level output signal having a 
true value in response to the digital output signal of converter 14 having a value indicating 
that acoustic energy above a predetermined threshold is incident on microphone 10. Speech 
segmentor 18 derives a digital signal that is divided into sequential speech segments, such as 
phonemes, for utterances of the speaker speaking into microphone 10. 

Speech segmentor 18 supplies the sequential speech segments in parallel to 
random access memory (RAM) 22 and dynamic ring buffer 24. RAM 22 includes an enable 
input terminal 23 connected to be responsive to the bi-level output signal of acoustic energy 
detector 16. In response to energy detector 16 deriving a true value, as occurs when 
microphone 10 is responsive to a speaker making an utterance or ambient noise, RAM 22 is 
enabled to be responsive to the output of speech segmentor 18. When enabled, sequential 
memory locations, i.e., addresses, in RAM 22 are loaded with the sequential segments that 
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segraentor 18 derives by virtue of a data input of the RAM being connected to the segmentor 
output. This is true regardless of whether the sequential segments are speech utterances or 
noise. RAM 22 has sufficient capacity to store the sequential speech segments of a typical 
utterance by the speaker as segmentor 18 is deriving the segments so that the first and last 
5 segments of a particular utterance, or noise, are stored at predetermined addresses in the 
RAM. 

Dynamic ring buffer 24 includes a sufficiently large number of stages to store 
the sequential speech segments segmentor 18 derives for a typical utterance. Thus, buffer 24 
effectively continuously records and maintains the last few seconds of acoustic energy 

10 supplied to microphone 10. RAM 22 and circuitry associated with it form a processing 
arrangement that effectively indexes dynamic ring buffer 24 to indicate when the first and 
last segments of utterances by the speaker who is talking into microphone 10 occur. If the 
acoustic energy incident on microphone 10 is not associated with an utterance, dynamic ring 
buffer 24 is not effectively indexed. Buffer 24 is part of a delay arrangement for assuring 

15 that (1) the beginning of each speech utterance is coupled to a speech recognizer and (2) upon 
completion of each utterance the speech recognizer is no longer responsive to a signal 
representing acoustical energy. 

To perform indexing of buffer 24 only in response to utterances by the speaker 
who is talking into microphone 10, the system illustrated in Figure 1 detects at least one 

20 facial characteristic associated with speech utterances of the speaker while acoustic energy is 
incident on microphone 10. The facial characteristic of the embodiment of Figure 1 is 
detection of lip motion. To this end, video camera 12 derives a signal indicative of hp 
motion of the speaker speaking into microphone 10. The lip motion signal that camera 12 
derives drives lip motion detector 26 which derives a bi-level signal having a true value while 

25 Up motion detector 26 senses that the hps of the speaker are moving and a zero value while 
Up motion detector 26 senses that the lips of the speaker are not moving. 

The bi-level output signals of acoustic energy detector 16 and motion detector 
26 drive AND gate 28 which derives a bi-level signal having a. true value only while the bi- 
level output signals of detector 16 and 26 both have true values. Thus, AND gate 28 derives 

30 a true value only while microphone 10 and camera 12 are responsive to speech utterances by 
the speaker; at aU other times, the output of AND gate 28 has a zero, i.e., not true, value. 

The output signal of AND gate 28 drives one shot circuits 30 and 32 in 
parallel. One shot 30 derives a short duration pulse in response to the leading edge of the 
output signal of AND gate 28, i.e., in response to the output of die gate having a transition 
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from the zero value to the true value. One shot 32 derives a short duration pulse in response 
to the trailing edge of the output signal of AND gate 28, i.e., in response to the output of the 
gate having a transition from the true value to the zero value. Hence, one shot circuits 30 and 
32 respectively derive short duration pulses only at the beginning and end of a speech 
5 utterance. One shot circuits 30 and 32 do not derive any pulses if (1) acoustic energy 
detector 16 derives a true value while lip motion detector 26 derives a zero value, (2) lip 
motion detector 26 derives a true value while acoustic energy detector 16 derives a zero 
value, or (3) neither of detectors 16 nor 26 derives a true value. 

The output pulses of one shot circuits 30 and 32 are supplied as write enable 

10 signals to first and second predetermined addresses of RAM 22. The first and second 

addresses are respectively for the first and last speech segments that segmentor 18 derives for 
a particular utterance. Hence, the first address stores the first speech segment that segmentor 
18 derives for a particular utterance, while the second address stores the last speech segment 
that segmentor derives for that same utterance. RAM 22 is enabled to be responsive to the 

1 5 sequential segments that segmentor 1 8 derives and the output signals of one shot circuits 3 0 
and 32 by virtue of acoustic energy detector 16 supplying the RAM enable input terminal 23 
with a true value during the speech utterance. RAM 22 responds to a transition of the output 
of acoustic energy detector 16 from a true value to a zero value to read out the contents of the 
first and second addresses to input terminals of comparison circuits 34 and 36, respectively. 

20 Comparison circuits 34 and 36 are respectively connected to be responsive to 

the contents of the speech segments stored in the first and second addresses of RAM 22 and 
the output of dynamic ring buffer 24 to detect the location in the ring buffer of the first and 
last speech segments of the particular utterance. In particular, upon the completion of a 
particular speech utterance, RAM 22 supplies (1) one input terminal of comparison circuit 34 

25 with a signal indicative of the speech content of the first speech segment of that utterance and 
(2) one input tenninal of comparison circuit 36 with a signal indicative of the speech content 
of the last speech segment of that utterance. 

While RAM 22 is driving comparison circuits 34 and 36 with the signals 
indicative of the speech content of the first and last speech segments of the utterance, 

30 dynamic ring buffer 24 is enabled by the transition at the trailing edge of the bi-level output 
of acoustic energy detector 16 to sequentially derive, at a high frequency (i.e., a frequency 
considerably higher than the frequency at which the segments are transduced by microphone 
10), the speech segments it stores. To this end, buffer 24 includes a read out enable input 
terminal 37 connected to be responsive to the trailing edge transition that detector 16 derives. 
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While enabled for read out, dynamic ring buffer 24 supplies the sequential speech segments 
it derives in parallel to second input terminals of comparison circuits 34 and 36. 

Comparison circuit 34 derives a pulse only in response to the speech segment 
that buffer 24 derives being the same as the first segment that RAM 22 supplies to 
5 comparison circuit 34. Comparison circuit 36 derives a pulse only in response to the speech 
segment that buffer 24 derives being the same as the last segment that RAM 22 supplies to 
comparison circuit 36. Gate 38 has first and second control input terminals respectively 
connected to be responsive to the output pulses of comparison circuits 34 and 36 and a data 
input terminal connected to be responsive to the sequential speech segments dynamic ring 

1 0 buffer 24 derives. Gate 38 is constructed so that in response to comparison circuit 34 
supplying the first control input terminal of the gate with a pulse, the gate is opened and 
remains open until it is closed by comparison circuit 36 supplying the second control input 
terminal of the gate with a pulse. 

While gate 38 is open, it passes to automatic speech recognizer 40 the first 

1 5 through the last speech segments dynamic ring buffer 24 supplies to its data input terminal. 
Automatic speech recognizer 40 can be of any known type that responds only to signals 
representing acoustic energy and produces an output signal indicative of the speech 
utterances of the speaker talking into microphone 10 while the speaker is being observed by 
video camera 12. The output signal of speech recognizer 40 drives output device 42. 

20 Examples of output device 42 are a computer character generator for driving a computer 
display with alphanumeric characters commensurate with the utterances or a machine for 
performing tasks commensurate with the utterances. 

The speech recognition system of Figure 1 can be modified by the 
arrangement illustrated in Figure 2 so that the speech recognition system will not respond to 

25 speech utterances when the speaker is not looking at camera 12 and so that it can respond to 
speech utterances and the faces of a plurality of speakers. The apparatus illustrated in Figure 
2 is connected to respond to the output signal of acoustic energy detector 16, Figure 1, and 
replaces lip motion detector 26 and AND gate 28. 

The apparatus of Figure 2 includes face recognizer 50, connected to be 

30 responsive to the output signal of video camera 12, and speaker identity recognizer 52, 
connected to be responsive to the output signal of acoustic energy detector 16. Face 
recognizer 50 and speech identity recognizer 52 are connected to other circuit elements and 
to speech recognizer 40 so that the speech recognizer is activated only when the speaker is 
facing video camera 12, that is, has a predetermined orientation relative to the video camera. 
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Hence, if the speaker turns away from and is not looking directly into video camera 12 
because the speaker is talking to someone and does not desire to have his/her voice 
recognized by recognizer 40, recognizer 40 is not activated. Speech recognizer 40 is only 
activated if the face recognizer 50 and speech recognizer 52 identify the same person. Face 
5 recognizer 50 and speech recognizer 52 are trained during at least one training period to 
recognize the face and speech of more than one person and speech recognizer 40 is activated 
only if the face and speech are recognized as being for the same person. 

To these ends, speaker identity recognizer 52 includes memory 54 having one 
input connected to be responsive to the speech signal output of analog to digital converter 14 

10 and a second input connected to be responsive to the output of acoustic energy detector 16 so 
that memory 54 stores short-term utterances of the speaker while detector 16 derives a true 
value. Upon the completion of the utterance, memory 54 supplies a digital signal indicative 
of the utterance to one input of comparator 56, having a second input responsive to memory 
58 which stores digital signals indicative of the speech patterns of a plurality of speakers who 

1 5 have trained speech recognizer 40 . 

Comparator 56 derives a true output signal in response to the output signal of 
speaker memory 54 matching one of the speech patterns that memory 58 stores. Comparator 
56 derives a separate true signal for each of the speakers having a speech pattern stored in 
memory 58. In Figure 2, it is assumed that memory 58 stores speech patterns for first and 

20 second different speakers, whereby comparator 56 includes output leads 57 and 59, 
respectively provided for the first and second speakers. In response to comparator 56 
recognizing the speaker as having speech characteristics the same as the speech pattern that 
memory 58 stores for the first and second speakers, comparator 57 respectively supplies true 
values to output leads 57 and 59. 

25 Face recognizer 50 includes memory 60 having an input connected to be 

responsive to the output of video camera 12 so that memory 60 stores one frame of an image 
being viewed by video camera 12. Upon completion of the frame, memory 60 supplies a 
digital signal indicative of the frame contents to one input of comparator 62, having a second 
input responsive to memory 64 which stores digital signals indicative of the facial patterns of 

30 each of the plurality of speakers; the facial patterns memory 64 stores are derived while the 
speakers are looking directly into camera 12, that is, while the faces of the speakers have a 
predetermined orientation relative to the camera. Comparator 62 derives a true output signal 
in response to the output signal of memory 60 matching one of the facial patterns that 
memory 64 stores. Comparator 62 derives a separate true signal for each of the speakers with 
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facial images stored in memory 64. In the example of Figure 2, memory 64 stores facial 
images for the first and second speakers, whereby comparator 64 includes output leads 66 
and 68, respectively provided for the first and second speakers. In response to comparator 64 
recognizing the speaker as having a facial image the same as one of the facial images that 
5 memory 60 stores for the first and second speakers, comparator 62 respectively supplies true 
values to output leads 66 and 68. 

During a training period for each of the speakers, each of the speakers recites a 
predetermined sequence of words, and the speaker is looking directly into video camera 12. 
At this time, speaker memory 54 is connected to an input of memory 58 to cause the memory 

10 58 to store speech patterns for each of the plurality of speakers who train speech recognizer 
40. At the same time, image memory 60 is connected to an input of memory 64, to cause 
memory 64 to store a facial image for each of the plurality of speakers who train speech 
recognizer 40. During the training period for each of the speakers, the output of speech 
segmentor 16 is supplied to the input of speech recognizer 40 to enable the speech recognizer 

1 5 to learn the speech patterns of each of the speakers, in a manner known to those skilled in the 
art. 

The output signals of comparators 56 and 62 on leads 57 and 66 are supplied 
to inputs of AND gate 70, while the output signals of the comparators on leads 59 and 68 are 
supplied to inputs of AND gate 72. Hence, AND gate 70 derives a true value only in 

20 response to face recognizer 50 and speech identity recognizer 52 both recognizing that a 
speaker is the first speaker who is looking directly into camera 12. Similarly, AND gate 72 
derives a true value only in response to face recognizer 50 and speech identity recognizer 52 
both recognizing that a speaker is the second speaker who is looking directly into camera 12. 
AND gates 70 and 72 derive bi-Ievel signals that are supplied to OR gate 74 which derives a 

25 true value in response to either the first or second speakers being identified from the voice 
and facial characteristics thereof. 

The output signal of OR gate 74 drives one shots in the same manner that the 
output of AND gate 28 drives the one shots. Consequently, the speech signal of the first or 
second speaker is supplied to speech recognizer 40 in the same manner that the speech signal 

30 is supplied to speech recognizer 40 in the embodiment of Figure 1 . 

To enable speech recognizer 40 of Figure 2 to recognize both speakers, the 
outputs of AND gates 70 and 72 are supplied to speech recognizer 40. Speech recognizer 40 
responds to the outputs of AND gates 70 and 72 to analyze the speech of the correct speaker, 
in a manner known to those skilled in the art, 
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While there has been described and illustrated a specific embodiment of the 
invention, it will be clear that variations in the details of the embodiment specifically 
illustrated and described may be made without departing from the true spirit and scope of the 
invention as defined in the appended claims. For example, the discrete circuit elements can 
be replaced by a programmed computer. 
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1 . A speech recognition system comprising an acoustic detector (10) for 
detecting speech utterances of a speaker; a visual detector (12, 26) for detecting at least one 
facial characteristic associated with speech utterances of the speaker; a processing 
arrangement (16, 18, 22, 24, 28, 30, 32, 34, 36, 38) connected to be responsive to the acoustic 

5 and visual detectors for deriving a signal having first and second values respectively 

indicative of the speaker making and not making speech utterances such that the first value is 
derived in response to the acoustic detector detecting a finite, nonzero acoustic response 
while the visual detector detects at least one facial characteristic associated with speech 
utterances of the speaker; and a speech recognizer (40) for deriving an output indicative of 
10 the speech utterances as detected only by the acoustic detector, the speech recognizer being 
connected to be responsive to the acoustic detector in response to the signal having the first 
value. 

2. The speech recognition system of claim 1 wherein the processing arrangement 
15 causes the signal to have the second value in response to any of: (a) the acoustic detector not 

detecting a finite, nonzero acoustic response while the visual detector does not detect speech 
utterances of the speaker, (b) the acoustic detector detecting a finite, nonzero acoustic 
response while the visual detector does not detect speech utterances of the speaker, and (c) 
the acoustic detector not detecting a finite, nonzero acoustic response while the visual 
20 detector detects speech utterances of the speaker. 

3. The speech recognition system of claim 1 or 2 wherein the processing 
arrangement includes a delay arrangement (22, 24, 34, 38) for assuring that the beginning of 
each speech utterance is coupled to the speech recognizer. 

25 

4. The speech recognition system of any of claims 1-3 wherein the processing 
arrangement includes a delay arrangement (22, 24, 36, 38) for assuring that in response to 
completion of each speech utterance the acoustic detector is decoupled from the speech 
recognizer. 
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5. The speech recognition system of claim 3 or 4 wherein the delay arrangement 
includes a memory element (24) connected to be responsive to the acoustic detector, the 
memory element including a plurality of stages for storing sequential segments of the output 

5 of the acoustic detector, the delay arrangement being such that the contents of the memory 
element stage storing the beginning of a speech utterance are initially coupled to the speech 
recognizer. 

6. The speech recognition system of claim 5 wherein the memory element 
1 0 includes a ring buffer (24) . 

7 . The speech recognition system of any of claims 1 -6 wherein the processing 
arrangement includes a face recognizer (60, 62, 64) connected to be responsive to the visual 
detector. 

15 

8. The speech recognition system of claim 7 wherein the face recognizer is 
arranged for enabling the signal to have the first value in response to the face of the speaker 
being at a predetermined orientation relative to the visual detector. 

20 9. The speech recognition system of claim 7 or 8 wherein the face recognizer is 

arranged for: (1) detecting and distinguishing the faces of a plurality of speakers, and (2) 
enabling the signal to have the first value in response to the speaker having a recognized face. 

1 0. The speech recognition system of any of claims 1 -9 wherein the processing 

25 arrangement includes a speaker identity recognizer (54, 56, 58) connected to be responsive to 
the acoustic detector, the speaker identity recognizer being arranged for: (1) detecting and 
distinguishing speech patterns of a plurality of speakers, and (2) enabling the signal to have 
the first value in response to the speaker having a recognized speech pattern. 

30 H. The speech recognition system of any of claims 1-10 wherein the processing 

arrangement is arranged for causing the signal to have the first value in response to the 
speaker having a recognized face matched with a recognized speech pattern of the same 
speaker. 
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